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@ Speech-silence detection with subband coding. 

© Speech detection is accomplished in conjunction with 
two-band subband encoding. A detection statistic T 0t o ), 
used to estimate the short-term speech energy, is developed 
from energy estimates made in each subband. A speech 
presence energy threshold Aon a speech silence energy 
threshold Xoff and Xqff are computed which adapt to the 
long-term speech level. The detection statistic is compared 
to the thresholds to make a decision concerning the presence 
or absence of speech. 

Also disclosed are considerations for extrapolating the 
detection to result in an arrangement with more than two 
subbands. 
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SPEECH -SILENCE DETECTION WITH SUBBAND CODING 

Technical Field 

The invention relates to signal processing 
5 generally, and more particularly to means for detecting 

intervals of silence in encoded speech. 

Background of the Invention 

Normal human speech includes intervals of 

silence which will be referred to herein as "speech 
10 silence." When the speech is transmitted electronically, 

such as in a communications network, the speech-silence 

occupies a significant portion of the total transmission 

time. This leads to inefficient use of the communications 

network, since the only information which is transmitted 
15 during the course of the entire speech-silence interval, no 

matter how long, is the existence of the interval and its 

duration. 

Efforts have been made to improve the efficiency 
of transmission by inserting other information, such as 

20 data, in the silence intervals on a time assignment basis. 
Such an approach is presently used for transatlantic cable 
and satellite communications which are known as TASI (time 
assignment and speech interpolation) systems. A system: of 
this type is described, for instance in U.S. Pat. 

25 4,100,377. 

Speech silence may be detected even in voice 
signals which have already been digitally encoded into a 
pulse code modulated (PCM) format. This is described, for 
example, in U.S. Pats. 3,909,532 and 4,449,190. 

30 Where both encoded speech and data signals share 

a carrier on a time assignment basis, there is a need for a 
high degree of accuracy in the determination of speech- 
silence intervals in order to permit the maximum use of the 
interval without degradation of the reconstructed speech. 

35 Of primary interest in this regard, therefore, are speech- 
silence boundaries. These are a transition either from 
voice to silence or from silence to voice. Accordingly, 
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there is a need for speech-silence boundary detection with 
improved accuracy. 
Summary of the Invention 

In accordance with the novel method and apparatus 
5 of the present invention, speech-silence boundaries are 
detected in the digitally encoded data of at least two 
subbands of the speech signal. Energy estimates are made 
for each of the frequency. subbands for generating a 
detection statistic to estimate short-term speech energy. 
10 A threshold which is adapted to the long-term speech level 
is computed. This threshold is compared to the detection 
statistic to make a decision as to the presence of a 
silence interval. The resulting detection has 
significantly improved accuracy over detection using only 
15 one frequency band. 

Brief Description of the Drawing 

FIG. 1 is a functional block circuit diagram of a 
two-band subband encoder with speech detection in 
accordance with one example of the present invention. 
20 fig. 2 is a functional flow diagram showing in 

more detail a speech statistic computation subunit of the 

apparatus of FIG. 1 . 

FIG. 3 is a functional flow diagram showing in 
more detail a threshold computation subunit of the 

25 apparatus of FIG. 1. 

FIG. 4 is a functional flow diagram showing in 
more detail a speech determination subunit of the apparatus 
of FIG. 1. 

Detailed Description 

30 Tne two-band subband encoder 10 with speech 

detection shown in FIG. 1 includes a lower frequency 
subband, or low band encoding circuit 12 made up of a low 
pass quadrature mirror filter 14, a by-two decimator 16, 
and an ADPCM (adaptive digital pulse code modulation) 

35 encoder 18. In parallel with the low band circuit 12 is a 
higher frequency subband, or high band encoding circuit 20 
made up of a high pass quadrature mirror filter 22, a by- 
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two decimator 21, and .„ aopc encoder 26. Botn of the 
encoding ircults , 2> 20 Qperate uith a ^ he 

II re ° elVe ' 5 - 5 "* -lo, speech 

input signal. They send their outputs to a m„i,- ! 

S tr_.ss.on. T he details of 

such as the circuits 12 20 «-k , , circuits 

Known to those in the rt 1 are est be^T " 

Croc Mere ln the BeU Sy stem Technical Journal vol 60 
'0 No 7. Part 2, pp. ,633-,653. Sept. , 981 , and „ £ 

voxce storage m a Microprocessor," by a i „" 

n t^u ^ ' " u • Flanagan, J 

Comm J " UPt °"' IEEE *«"«etion. On 

Commumcatxons. Feb. „. a . vol . ^ 30> pp .3 36 . 345 

15 thresho,. * SPeeCH dete " 0r 30 ' WhlCh lncludes '• =Peech ' 
15 threshold computing subunit 32, a speech statistic 

computing subunit 34, and a determining subunit 36 is 
adapted to provide an output to the „ulti Pl e,er 28 which 

Ind cat" ln lnSer "° n ° £ * 6P " Ch 

indicator, or speech flag, i„ the transmitted output The 
20 input to the speech threshold computing subunit 32 the 
step size information from the low band encoder ,2 The 
input to the speech statistic computing subunit 34 is the 
ample step si,e information from both the !o„ ba d encoder 
12 and the high band encoder 20. Both the threshold 
25 ubunit 32 and the statistic subunit 34 give their output 
to the speech determining subunit 36. 

The statistic computing subunit 34 is shown in 

- t o ma s separate deterBinaUo: - -r al „„, 

» .i~ U determined and used JlZZVll IV^IT^ 
table. The lo, ste P -si« parameters are used a L I tes 
of the speech in each band at a gl V6 „ time. 
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Referring now to FIG. 2, the speech sampling 
period is represented by t 0 . The log of the step size in 
the low band is represented by d L (ir 0 ), while the log 
of the step size in the high band is represented by d H 
5 (ii 0 ) at time t-i t 0 . Let T(1t 0 > be the speech 

detection statistic used to determine the speech level. 
Let o L and a H be fixed weights associated with d L 
(iT0 ) and d H (ix 0 ), and let B DS be a fixed 
weight such that 0<B DS <1- Then a detection statistic 
10 T ( i t q ) can be computed as follows: 

T ( i T Q ) - B DS T[(i-D T Q ] + C L d L (iT Q ) + O H d H (iT 0 ). (D 
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The detection statistic T(it 0 ) is smoothed to become a 
low-pass filtered sum of speech information taken from each 
subband. The weight B DS is chosen to give T(ix 0 ) a 
specific time constant which controls the necessary 
smoothing of the information. A time constant of 16 
milliseconds has been found to be suitable. The constants 
o T and o H determine the relative weight given to each 
subband. It has been found to be particularly advantageous 
to set o H at a value of about 1 . 5 to 2 times the value of 
o . This accentuates discrimination in the high subband, 
which contains more information for the detection of 
fricatives and other consonants. The values of these 
constants for a particular application may be readily 
determined by means of laboratory tests by one skilled in 
the art. 

FIG. 3 shows the method of computing a speech 
presence energy threshold X QN and a speech silence 
energy threshold x 0Fp . This method is very similar to 
that used in ADPCM speech detection, using the log step 
size d L (ii 0 ) from the lower subband onl y- M(iT 0 > is 
35 the maximum of the values o M d L (iT 0 )j o M is a 
constant weight. Therefore, when a H d L (it 0 ) 
increases, M(ix 0 j i ncre ases when °MdL (i T 0) 
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10 



decreases, M(it 0 ) decreases only very slowly according to 
the leak factor B M . M(ii 0 ) is restrained from 
decreasing to less than its lower limit (M Q ), so M ( i t q ) 
measures the maximum speech energy in the lower subband. 

The variable d' L can be defined to be 
d' L (iT 0 ) - d L (iT 0 ) + 32; (2) 

the bias of 32 is used to insure that d' L and M are 
always positive. The value of M at time itp is 



|e M M(i-1) T 0 ,o M d' L (ii 0 ),M 0 | 



M (ix 0 ) = max ^ B M M(i-1 ) t 0 , o M d ' L ( i To ) ,M 0 ) (3) 

The thresholds are fixed distances below M, so, the 
threshold X~„, used to determine when speech changes 

ON 

15 from OFF to ON, is computed as follows: 

W iT 0> " M < iT 0>- C ON (4) 

the threshold X 0Fp , used to determine when speech 
20 changes from ON to OFF, is 

*OFF <iT 0 } " C 0FF ? (5) 

the values of C QN and C QFF are constants, with 

25 C OFF > C ON* 

FIG. 4 shows how the comparison is done. The 

speech samples are divided into blocks of some convenient 

length. (In this case 24 samples per block are used.) 

Once per block, a decision is made concerning whether 

30 speech is ON or OFF. If f in the previous block, speech was 
on, then the ON threshold is used; if speech was off, the 
OFF threshold is used. The switch in FIG. 4 chooses the 
correct threshold, which is then compared to the detection 
statistic. The speech flag is set OH or OFF depending on 

35 whether the detection statistic is above or below the 
threshold. Let t ds be the time interval associated 
with one block. (In this case, t ds tt 24 T 0.) Let S 
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denote the speech state with two possible values: 

0 ,-to-indicate-speech-presence, 
1 r -to-indicate-silence. 



(6) 



The speech state S (iT DS ) at time t=i t ds depends 

on the previous speech state S[(i-1) x DS l as follows: 

when 

10 Sl(i-1 ) T Dg ] = 0. 



[0, if T(iT DS ) > x opF (t ds ); 
jl,if T(ix DS ) < X QFF (it DS ), i=1,2,...; 



when S[ ( i-1 ) T Dg l = 1 , 



o, if T(ir DS ) > x 0N (it DS ); 



(8) 



20 (1, if T(it DS ) < X m (iTDg )r i-1 ,2, 



The system 10 can be effectively implemented by a 
person of ordinary skill in the art of subband encoding by 
appropriately adapting two or more digital signal processor 
25 microcomputers. Such microcomputers are presently in use 
and may include a memory unit, an arithmetic unit, a 
control unit, an input-output unit, and a machine language 
storage unit in a single VLSI circuit. Their function may 
alternately be provided by a combination of a number of 
different VLSI circuits interconnected. One such 
microcomputer which is suitable for implementing the system 
10 is a DSP (Digital Signal Processor) manufactured by 
AT&T Technologies, Inc., a corporation of New York, U.S.A. 
and described, for example, in the above-mentioned Bell 
35 System Technical Journal volume. 

In one example of a system implemented with two 
DSP's, one DSP is used for the encoding and transmission of 
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speech, while the other DSP is used for the reception and 
decoding of speech. External logic is used to interface 
the PCM (pulse code modulation) bit streams of each DSP to 
both analog-to-digital and digital-to-analog converters 
for speech input and output. The DSP microcomputers also 
perform speech-silence detection on the speech signal, so 
that the silence intervals can be used to transmit user- 
supplied data. 

The DSP microcomputers determine the speech state 
every two milliseconds. The transmitting DSP provides the 
speech-state status for external circuitry and generates a 
112-bit frame for transmission. The frame consists of a 3- 
bit framing pattern, a 1-bit speech flag, and 24 samples of 
subband encoded speech. This speech is sampled at a 12 Khz 
15 rate and encoded with 5-bit accuracy in the low band and 4- 
bit accuracy in the high band, when the DSP indicates the 
speech flag is on, external line interface circuitry will 
send the DSP-generated frame intact. When the speech flag 
is off, the 24 samples of speech is replaced by 108 bits of 
20 user supplied data. After construction, the frame is sent 
over a 56 Kbps (kilobits per second) digital channel to 
another terminal for decoding. 

In the receiver, a simple framing algorithm is 
implemented with a combination of DSP firmware and external 
25 line interface circuitry. The framing algorithm searches 
the incoming 56 Kbps signal to find the orientation of the 
3-bit framing pattern. After the receiving DSP 
synchronizes itself with the framing pattern, it reads the 
speech state flag. If the speech state flag is present, 
30 the DSP begins decoding the incoming speech signal for 
listening, but if the flag is absent, the DSP signals 
external circuitry to remove the data and send it to a user 
interface. This pattern is repeated every two 
milliseconds, as long as a valid framing pattern is 
35 detected. 

The equations above describe the general concepts 
involved in determining the quantities needed by the speech 
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detector. Due to finite bit length and timing 
considerations in the DSP, some of these equations are 
preferably slightly modified. For example, the system 10 
is based on a 24-sample frame, so every 24 samples a 
decision is made as to whether speech is present. The 
speech detection statistic is computed in this framework by 
the DSP as follows: 

T(iT DS )=6 DS Tl(i - 1)T DS ] 
24 

+ °L d, L^ T 0 +(i - l)T DS l 

3 = 1 

+ O H d 'H [jT 0 +(i " 1)T DS 1 'j = 1 ' 2 '"- ( Q ) 

15 

So T (i.T nc ) is updated each sample period by adding 
o L d f L + o H d* to it, and it is leaked once per block 
of 24 samples. The value of the maximum level M must also 
be computed slightly differently to obtain accurate results 
20 with the DSP. Let * MAX be the time interval between 
two successive points at which M is leaked. 
Experimentally, it was found that ^^^^=8 seconds works 
well. The equation for M that may be implemented in the 
DSP is 



25 



30 



M(iT Q ) = jmax o M d' L (iT 0 ),M[(i-l Jt q ] \ , 
for mod ix n *0 , i = l f 2 , . . . , 

f T ¥ AX 



M ( i t q ) = max 



je M maxj o M d' L (iT 0 ),M((i-1) t q ] },M q ) , 
r mod ^ it ft =0 f i=l ,2, . . '(l< 



for mod ^ it ft =0 f i=l ,2, . . (10) 
T MAX U 



The thresholds only need to be computed once per 24 
samples, so that they can be used to detect the presence or 
35 absence of speech. 

X ON (iT DS )=M(iT DS ) - C ON ; 



X OFF (iT DS ,=M(iT DS , - C OFF 



(11 ) 
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The speech state is determined in the same way as described 
m Section II. 2 by equations (6-8). 

This invention is not limited to two-band 
subband coding. The detection statistic T(ix 0 , and 
5 maximum level m ( i t q ) can include information from a 
larger number of subbands, using equations similar to 
equations (1, - (11) above . silence detecfcion wifch f . ve _ 

band subband coding is an example of this. Let d.(i T() , 

f ° r 3 C 1 5 be the log step size values for eact/of 

10 the five bands, let o -i = i «; k~ *■ 

, -let a jr j i,...,5 be fixed weights, and 

let b ds be a leak factor slightly less than 1. m 
analogy with equation (1), a general equation describing 
the speech detection statistic is 

15 'ti'o'-V^-^'oJ^^jliTo). (12, 

Letting „ , j=1,..., 5 be fixed weights, and b m a fixed 
leak factor slightly less than 1, the general equation for 
the maximum level is 



20 



«(ix 0 )=max | M Mr(i-1, To] , £^(i, 0 ) rH ^ . (13, 



25 



As 



30 



Some of the weighting factors 0j or u could be zero 
xn equations (9)-<11>, equations (12, -(13) can be slightly 
altered to conform to a specific hardware implementation 
such as an implementation using a DSP microprocessor. it 
is also necessary to choose specific values of the 
parameters in equations (12,- (13,. For the computation of 
the detection statistic, o, = o 2 , and o 3 = o 4 = 
2o 1r giving a greater weight to the higher frequency 
bands; band 5 is not used, so o 5 « 0. For the 
computation of a maximum level, V} « U2 , and p;j = 
u 4 " p 5 " °- The maximum level depends on the energy 
35 in the low-frequency bands, giving a smooth long-term 
average. 

In theory, the equations (12, and (13, can be 
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extended to any number of bands. However, as the number of 
bands increases, the time delay associated with computing 
the detection statistic and maximum level also increases 
Therefore there is a practical limit to the number of bands 
5 that can be used in this system. 
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1. Signal encoding apparatus 
CHARACTERIZED BY 

means for encoding a plurality of frequency 
5 subband portions of a signal, including means for 

generating voltage step size values for signal samples of 
each subband; 

means for computing speech statistic values 
based on the voHage step size values for the one frequency 
0 ubband and the voltage step sise values for another of th e 
frequency subbands; and 

means for comparing speech presence energy 
threshold values and speech silence energy threshold values 
to the speech statistic values to selectively generate 
> speech presence output signals. 

qDp h , ^ ThS W aratus ^fined in claim 1 wherein said 
speech statistic value computing means is 
CHARACTERIZED BY 

means for multiplying the step size values of 
each subband by a corresponding speech detection 
coefficient to generate respective speech detection value 
products; 

means for summing the speech detection value 
products to generate speech detection value sums, and 

means for smoothing the speech detection value 
sum. vaxue 

3. The apparatus defined in claim 2 
CHARACTERIZED IN THAT 

said smoothing means comprises means for summing 
each speech detection value sum with a delay value to 
generate a speech detection statistic output value, the 
delay value being the product of a detection constant and a 
previous detection statistic output value. 

4. The apparatus defined in claim 3 

CHARACTERIZED BY 

means for computing speech energy threshold 
values and speech silence threshold values based on the 
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voltage step size values for one of the subbands. 

5. The apparatus defined in claim 4 wherein said 
speech statistic value computing means is 

CHARACTERIZED BY 

means for generating a speech presence threshold 
value and a speech silence value from a maximum energy 
level value, the maximum energy level value being generated 
by choosing the maximum of first and second energy levels, 
the first energy level being the product of a step size 
value of the low frequency subband and the second energy 
level being the larger of the previous sample maximum 
energy level value multiplied by a coefficient and a lower 
1 i mi t . 

6. The apparatus defined in claim 5 
CHARACTERIZED BY 

switch means which connect either the speech 
threshold value or the speech silence value from the 
generating means to a one input of a comparator in response 
to a control signal, the other input of the comparator 
being connected to receive the speech detection statistic, 
and 

feedback means including a one-sample delay means 
connected between the output of said comparator and said 
switch for generating the control signals, 

7. A method of detecting the presence of speech 
content in a signal, 

CHARACTERIZED BY 

computing a short term speech statistic from the 
step size value information of at least two of the 
subbands, and 

comparing the speech statistic to a long term 
speech energy threshold to selectively generate a speech 
presence indication signal. 

8. The method defined in claim 7 further 
CHARACTERIZED BY 

computing a long term speech energy threshold 
from the step size information of at least one of the 



subbands . 
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9. The method defined in claim 8 
CHARACTERIZED BY 



giving greater weight to the <?t-^ • 

subband than to - - r:: r ~ *« 
band ~ 
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